Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
#format numeric data for easier readability
pd.set_option(
"display.float_format", lambda x: "%.2f" % x
) # to display numbers rounded off to 2 decimal places
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
palette="viridis",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(16, 6), kde=False, bins=None, hue=None):
"""
Combines boxplot and histogram
data: dataframe
feature: dataframe column
figsize: size of figure (default (16,6))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True,
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter",
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#reading csv file into a pandas Dataframe
data = pd.read_csv('/content/drive/MyDrive/Data Science- GL/7- Unsupervised Learning/Project 7 /stock_data.csv')
# copying data to another varaible to preserve original data
df = data.copy()
# print a sample of five rows randomly selected from the training data
df.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.35 | 10.00 | 1.69 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 668129938.50 | 3.72 | -8.78 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.24 | 8.34 | 2.20 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1633015873.00 | 18.81 | -8.75 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.91 | 11.30 | 1.27 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1504421769.00 | 15.28 | -0.39 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.94 | 13.98 | 1.36 | 9 | 180 | -240840000 | 629551000 | 1.26 | 499643650.80 | 74.56 | 4.20 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.32 | -1.83 | 1.70 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2247993548.00 | 178.45 | 1.06 |
df.sample(n=5)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 | AON | Aon plc | Financials | Insurance Brokers | 92.21 | 3.91 | 1.11 | 23 | 99 | 10000000 | 1385000000 | 4.93 | 280933062.90 | 18.70 | -7.76 |
| 66 | CINF | Cincinnati Financial | Financials | Property & Casualty Insurance | 59.17 | 9.78 | 0.94 | 10 | 99 | -47000000 | 634000000 | 3.87 | 163824289.40 | 15.29 | -4.33 |
| 311 | VMC | Vulcan Materials | Materials | Construction Materials | 94.97 | 6.03 | 1.85 | 5 | 81 | 142787000 | 221177000 | 1.66 | 133239156.60 | 57.21 | -0.27 |
| 320 | WFC | Wells Fargo | Financials | Banks | 54.36 | 5.53 | 0.97 | 12 | 99 | -460000000 | 22894000000 | 4.18 | 5477033493.00 | 13.00 | -0.94 |
| 312 | VNO | Vornado Realty Trust | Real Estate | REITs | 99.96 | 10.03 | 1.02 | 11 | 47 | 637230000 | 760434000 | 3.61 | 210646537.40 | 27.69 | -1.08 |
df.shape
(340, 15)
# print the data types of the columns within the datset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# checking for duplicate values
df.duplicated().sum()
0
# convert all columns with dtype object into category
for col in df.columns[df.dtypes=='object']:
df[col] = df[col].astype('category')
# dropping the ticker symbol column, as it does not provide any information
df.drop("Ticker Symbol", axis=1, inplace=True)
# confirm new dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Security 340 non-null category 1 GICS Sector 340 non-null category 2 GICS Sub Industry 340 non-null category 3 Current Price 340 non-null float64 4 Price Change 340 non-null float64 5 Volatility 340 non-null float64 6 ROE 340 non-null int64 7 Cash Ratio 340 non-null int64 8 Net Cash Flow 340 non-null int64 9 Net Income 340 non-null int64 10 Earnings Per Share 340 non-null float64 11 Estimated Shares Outstanding 340 non-null float64 12 P/E Ratio 340 non-null float64 13 P/B Ratio 340 non-null float64 dtypes: category(3), float64(7), int64(4) memory usage: 46.7 KB
Questions:
#provide statistical summary of all categorical columns
df.describe(include='category').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Security | 340 | 340 | 3M Company | 1 |
| GICS Sector | 340 | 11 | Industrials | 53 |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 |
#create labeled barplot of stocks by sector
labeled_barplot(df, 'GICS Sector')
#display the five sectors with the most number of stocks
df["GICS Sector"].value_counts().head(n=5)
Industrials 53 Financials 49 Consumer Discretionary 40 Health Care 40 Information Technology 33 Name: GICS Sector, dtype: int64
#create labeled barplot of stocks by sub industry
labeled_barplot(df, 'GICS Sub Industry')
#display the five sub industries with the most number of stocks
df['GICS Sub Industry'].value_counts().head(n=5)
Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Internet Software & Services 12 Electric Utilities 12 Name: GICS Sub Industry, dtype: int64
#provide statistical summary of all numerical columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.00 | 80.86 | 98.06 | 4.50 | 38.55 | 59.70 | 92.88 | 1274.95 |
| Price Change | 340.00 | 4.08 | 12.01 | -47.13 | -0.94 | 4.82 | 10.70 | 55.05 |
| Volatility | 340.00 | 1.53 | 0.59 | 0.73 | 1.13 | 1.39 | 1.70 | 4.58 |
| ROE | 340.00 | 39.60 | 96.55 | 1.00 | 9.75 | 15.00 | 27.00 | 917.00 |
| Cash Ratio | 340.00 | 70.02 | 90.42 | 0.00 | 18.00 | 47.00 | 99.00 | 958.00 |
| Net Cash Flow | 340.00 | 55537620.59 | 1946365312.18 | -11208000000.00 | -193906500.00 | 2098000.00 | 169810750.00 | 20764000000.00 |
| Net Income | 340.00 | 1494384602.94 | 3940150279.33 | -23528000000.00 | 352301250.00 | 707336000.00 | 1899000000.00 | 24442000000.00 |
| Earnings Per Share | 340.00 | 2.78 | 6.59 | -61.20 | 1.56 | 2.90 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.00 | 577028337.75 | 845849595.42 | 27672156.86 | 158848216.10 | 309675137.80 | 573117457.32 | 6159292035.00 |
| P/E Ratio | 340.00 | 32.61 | 44.35 | 2.94 | 15.04 | 20.82 | 31.76 | 528.04 |
| P/B Ratio | 340.00 | -1.72 | 13.97 | -76.12 | -4.35 | -1.07 | 3.92 | 129.06 |
Numerical Columns
What does the distribution of stock prices look like?
#create list of columns with numerical variables
num_col = df.select_dtypes(include=np.number).columns.tolist()
#display histograms and boxplots for all numerical columns
for col in num_col:
histogram_boxplot(df, col)
Current price
Price change
Volatility
Cash Ratio / ROE
Net Income / EPS
Estimated shares outstanding
P/E and P/B Ratio
The stocks of which economic sector have seen the maximum price increase on average?
df.groupby('GICS Sector')['Price Change'].mean().sort_values()
GICS Sector Energy -10.23 Utilities 0.80 Industrials 2.83 Financials 3.87 Materials 5.59 Consumer Discretionary 5.85 Real Estate 6.21 Telecommunications Services 6.96 Information Technology 7.22 Consumer Staples 8.68 Health Care 9.59 Name: Price Change, dtype: float64
How are the different variables correlated with each other?
#create correlation heat map for numerical variables
plt.figure(figsize=(14, 7))
sns.heatmap(
df[num_col].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap='viridis'
)
plt.show()
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
df.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
GICS Sector Information Technology 149.82 Telecommunications Services 117.00 Health Care 103.78 Financials 98.59 Consumer Staples 70.95 Energy 51.13 Real Estate 50.11 Consumer Discretionary 49.58 Materials 41.70 Industrials 36.19 Utilities 13.62 Name: Cash Ratio, dtype: float64
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
GICS Sector Energy 72.90 Information Technology 43.78 Real Estate 43.07 Health Care 41.14 Consumer Discretionary 35.21 Consumer Staples 25.52 Materials 24.59 Utilities 18.72 Industrials 18.26 Financials 16.02 Telecommunications Services 12.22 Name: P/E Ratio, dtype: float64
df.duplicated().sum()
0
df.isna().sum()
Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
# Select only numerical columns
numerical_columns = df.select_dtypes(include=['int', 'float']).columns.tolist()
plt.figure(figsize=(20, 50))
for i, variable in enumerate(numerical_columns):
plt.subplot(14, 3, i + 1)
sns.boxplot(df[variable], orient="h")
plt.tight_layout()
plt.title(variable)
plt.show()
#scale the data set before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
#create a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df, height=2,aspect=2 , diag_kind='kde')
plt.show()
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 11)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2683105560042285 Number of Clusters: 4 Average Distortion: 2.175554082632614 Number of Clusters: 5 Average Distortion: 2.1413259454317446 Number of Clusters: 6 Average Distortion: 2.053350668955699 Number of Clusters: 7 Average Distortion: 2.013058832832641 Number of Clusters: 8 Average Distortion: 1.9732145498041436 Number of Clusters: 9 Average Distortion: 1.9257827631867823 Number of Clusters: 10 Average Distortion: 1.8779730119595663
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(1, 11), timings=True)
visualizer.fit(subset_scaled_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
#fit KMeans model and provide silhouette scores for range of k clusters
sil_score = []
cluster_list = range(2, 11)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=42)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
#show scores in line graph
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.45797710447228496) For n_clusters = 4, the silhouette score is 0.45017906939331087) For n_clusters = 5, the silhouette score is 0.4599352800740646) For n_clusters = 6, the silhouette score is 0.3985379248608659) For n_clusters = 7, the silhouette score is 0.3868475076242907) For n_clusters = 8, the silhouette score is 0.3886929719130642) For n_clusters = 9, the silhouette score is 0.40581042332267614) For n_clusters = 10, the silhouette score is 0.18011528994705786)
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(2, 11), metric="silhouette", timings=True)
visualizer.fit(subset_scaled_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
#find optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=42))
visualizer.fit(subset_scaled_df)
visualizer.show()
plt.show()
#create kmeans cluster model
kmeans = KMeans(n_clusters=5, random_state=42)
#fit model to scaled dataset
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=42)
# adding kmeans cluster labels to the original dataframe
df["KMeans_clusters"] = kmeans.labels_
#group dataset by kmeans cluster labels
cluster_profile = df.groupby("KMeans_clusters").mean()
#add counts for number of stocks in each cluster
cluster_profile["Count"] = (
df.groupby("KMeans_clusters")["Current Price"].count().values
)
cluster_profile.style.highlight_max(color="lightblue", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KMeans_clusters | ||||||||||||
| 0 | 50.477272 | 5.588148 | 1.141171 | 31.545455 | 64.181818 | -2581727272.727273 | 14675545454.545454 | 4.490909 | 4012176129.000000 | 14.010093 | -5.356393 | 11 |
| 1 | 81.418719 | 10.536341 | 1.578634 | 17.000000 | 367.538462 | 3857062692.307693 | 3129067846.153846 | 2.197692 | 1334755181.584615 | 65.639418 | 6.235848 | 13 |
| 2 | 73.769121 | 5.466467 | 1.392827 | 34.632143 | 55.710714 | 4183132.142857 | 1443269353.571429 | 3.629625 | 430217149.035393 | 24.132318 | -3.203999 | 280 |
| 3 | 38.808966 | -13.680395 | 2.938240 | 106.034483 | 55.551724 | -189825655.172414 | -3578126517.241379 | -8.657586 | 463121182.880690 | 85.946813 | 1.888901 | 29 |
| 4 | 585.527134 | 7.752090 | 1.508020 | 17.571429 | 159.142857 | 210520428.571429 | 804590428.571429 | 14.410000 | 116080574.760000 | 118.763084 | 33.713020 | 7 |
# print the names of the companies in each cluster
for cl in df["KMeans_clusters"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df[df["KMeans_clusters"] == cl]["Security"].unique().to_list())
print()
In cluster 2, the following companies are present: ['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Archer-Daniels-Midland Co', 'Ameren Corp', 'American Electric Power', 'AFLAC Inc', 'American International Group, Inc.', 'Apartment Investment & Mgmt', 'Assurant Inc', 'Arthur J. Gallagher & Co.', 'Akamai Technologies Inc', 'Albemarle Corp', 'Alaska Air Group Inc', 'Allstate Corp', 'Allegion', 'Applied Materials Inc', 'AMETEK Inc', 'Affiliated Managers Group Inc', 'Ameriprise Financial', 'American Tower Corp A', 'AutoNation Inc', 'Anthem Inc.', 'Aon plc', 'Amphenol Corp', 'Arconic Inc', 'Activision Blizzard', 'AvalonBay Communities, Inc.', 'Broadcom', 'American Water Works Company Inc', 'American Express Co', 'Boeing Company', 'Baxter International Inc.', 'BB&T Corporation', 'Bard (C.R.) Inc.', 'BIOGEN IDEC Inc.', 'The Bank of New York Mellon Corp.', 'Ball Corp', 'Bristol-Myers Squibb', 'Boston Scientific', 'BorgWarner', 'Boston Properties', 'Caterpillar Inc.', 'Chubb Limited', 'CBRE Group', 'Crown Castle International Corp.', 'Carnival Corp.', 'CF Industries Holdings Inc', 'Citizens Financial Group', 'Church & Dwight', 'C. H. Robinson Worldwide', 'Charter Communications', 'CIGNA Corp.', 'Cincinnati Financial', 'Colgate-Palmolive', 'Comerica Inc.', 'CME Group Inc.', 'Cummins Inc.', 'CMS Energy', 'Centene Corporation', 'CenterPoint Energy', 'Capital One Financial', 'The Cooper Companies', 'CSX Corp.', 'CenturyLink Inc', 'Cognizant Technology Solutions', 'Citrix Systems', 'CVS Health', 'Chevron Corp.', 'Dominion Resources', 'Delta Air Lines', 'Du Pont (E.I.)', 'Deere & Co.', 'Discover Financial Services', 'Quest Diagnostics', 'Danaher Corp.', 'The Walt Disney Company', 'Discovery Communications-A', 'Discovery Communications-C', 'Delphi Automotive', 'Digital Realty Trust', 'Dun & Bradstreet', 'Dover Corp.', 'Dr Pepper Snapple Group', 'Duke Energy', 'DaVita Inc.', 'eBay Inc.', 'Ecolab Inc.', 'Consolidated Edison', 'Equifax Inc.', "Edison Int'l", 'Eastman Chemical', 'Equity Residential', 'Eversource Energy', 'Essex Property Trust, Inc.', 'E*Trade', 'Eaton Corporation', 'Entergy Corp.', 'Edwards Lifesciences', 'Exelon Corp.', "Expeditors Int'l", 'Expedia Inc.', 'Extra Space Storage', 'Fastenal Co', 'Fortune Brands Home & Security', 'FirstEnergy Corp', 'Fidelity National Information Services', 'Fiserv Inc', 'FLIR Systems', 'Fluor Corp.', 'Flowserve Corporation', 'FMC Corporation', 'Federal Realty Investment Trust', 'First Solar Inc', 'General Dynamics', 'General Growth Properties Inc.', 'Corning Inc.', 'Genuine Parts', 'Garmin Ltd.', 'Goodyear Tire & Rubber', 'Grainger (W.W.) Inc.', 'Hasbro Inc.', 'Huntington Bancshares', 'HCA Holdings', 'Welltower Inc.', 'HCP Inc.', 'Hartford Financial Svc.Gp.', 'Harley-Davidson', "Honeywell Int'l Inc.", 'HP Inc.', 'Hormel Foods Corp.', 'Henry Schein', 'Host Hotels & Resorts', 'The Hershey Company', 'Humana Inc.', 'International Business Machines', 'IDEXX Laboratories', 'Intl Flavors & Fragrances', 'International Paper', 'Interpublic Group', 'Iron Mountain Incorporated', 'Illinois Tool Works', 'Invesco Ltd.', 'J. B. Hunt Transport Services', 'Jacobs Engineering Group', 'Juniper Networks', 'Kimco Realty', 'Kimberly-Clark', 'Kansas City Southern', 'Leggett & Platt', 'Lennar Corp.', 'Laboratory Corp. of America Holding', 'LKQ Corporation', 'L-3 Communications Holdings', 'Lilly (Eli) & Co.', 'Lockheed Martin Corp.', 'Alliant Energy Corp', 'Leucadia National Corp.', 'Southwest Airlines', 'Level 3 Communications', 'LyondellBasell', 'Mastercard Inc.', 'Mid-America Apartments', 'Macerich', "Marriott Int'l.", 'Masco Corp.', 'Mattel Inc.', "Moody's Corp", 'Mondelez International', 'MetLife Inc.', 'Mohawk Industries', 'Mead Johnson', 'McCormick & Co.', 'Martin Marietta Materials', 'Marsh & McLennan', '3M Company', 'Altria Group Inc', 'Marathon Petroleum', 'Merck & Co.', 'M&T Bank Corp.', 'Mettler Toledo', 'Mylan N.V.', 'Navient', 'NASDAQ OMX Group', 'NextEra Energy', 'Newmont Mining Corp. (Hldg. Co.)', 'Nielsen Holdings', 'Norfolk Southern Corp.', 'Northern Trust Corp.', 'Nucor Corp.', 'Newell Brands', 'Realty Income Corporation', 'Omnicom Group', "O'Reilly Automotive", "People's United Financial", 'Pitney-Bowes', 'PACCAR Inc.', 'PG&E Corp.', 'Public Serv. Enterprise Inc.', 'PepsiCo Inc.', 'Principal Financial Group', 'Procter & Gamble', 'Progressive Corp.', 'Pulte Homes Inc.', 'Philip Morris International', 'PNC Financial Services', 'Pentair Ltd.', 'Pinnacle West Capital', 'PPG Industries', 'PPL Corp.', 'Prudential Financial', 'Phillips 66', 'Praxair Inc.', 'PayPal', 'Ryder System', 'Royal Caribbean Cruises Ltd', 'Robert Half International', 'Roper Industries', 'Republic Services Inc', 'SCANA Corp', 'Charles Schwab Corporation', 'Spectra Energy Corp.', 'Sealed Air', 'Sherwin-Williams', 'SL Green Realty', 'Scripps Networks Interactive Inc.', 'Southern Co.', 'Simon Property Group Inc', 'S&P Global, Inc.', 'Stericycle Inc', 'Sempra Energy', 'SunTrust Banks', 'State Street Corp.', 'Skyworks Solutions', 'Synchrony Financial', 'Stryker Corp.', 'Molson Coors Brewing Company', 'Tegna, Inc.', 'Torchmark Corp.', 'Thermo Fisher Scientific', 'TripAdvisor', 'The Travelers Companies Inc.', 'Tractor Supply Company', 'Tyson Foods', 'Tesoro Petroleum Co.', 'Total System Services', 'Texas Instruments', 'Under Armour', 'United Continental Holdings', 'UDR Inc', 'Universal Health Services, Inc.', 'United Health Group Inc.', 'Unum Group', 'Union Pacific', 'United Parcel Service', 'United Technologies', 'Varian Medical Systems', 'Valero Energy', 'Vulcan Materials', 'Vornado Realty Trust', 'Verisk Analytics', 'Verisign Inc.', 'Vertex Pharmaceuticals Inc', 'Ventas Inc', 'Wec Energy Group Inc', 'Whirlpool Corp.', 'Waste Management Inc.', 'Western Union Co', 'Weyerhaeuser Corp.', 'Wyndham Worldwide', 'Xcel Energy Inc', 'XL Capital', 'Dentsply Sirona', 'Xerox Corp.', 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis'] In cluster 1, the following companies are present: ['Analog Devices, Inc.', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Bank of America Corp', 'Celgene Corp.', 'Facebook', 'Frontier Communications', 'Halliburton Co.', 'Intel Corp.', "McDonald's Corp.", 'Monster Beverage', 'Waters Corporation', 'Yahoo Inc.'] In cluster 4, the following companies are present: ['Alliance Data Systems', 'Amazon.com Inc', 'Chipotle Mexican Grill', 'Equinix', 'Intuitive Surgical Inc.', 'Priceline.com Inc', 'Regeneron'] In cluster 3, the following companies are present: ['Apache Corporation', 'Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Chesapeake Energy', 'Cabot Oil & Gas', 'Concho Resources', 'Devon Energy Corp.', 'EOG Resources', 'EQT Corporation', 'Freeport-McMoran Cp & Gld', 'Hess Corporation', 'Hewlett Packard Enterprise', 'Kinder Morgan', 'The Mosaic Company', 'Marathon Oil Corp.', 'Murphy Oil', 'Noble Energy Inc', 'Netflix Inc.', 'Newfield Exploration Co', 'National Oilwell Varco Inc.', 'ONEOK', 'Occidental Petroleum', 'Quanta Services Inc.', 'Range Resources Corp.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Wynn Resorts Ltd', 'Cimarex Energy'] In cluster 0, the following companies are present: ['Citigroup Inc.', 'Ford Motor', 'Gilead Sciences', 'General Motors', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
#print number of stocks within each sector for all of the clusters
for k in range(0,df['KMeans_clusters'].nunique()):
print('The number of stocks within each GICS Sector for Cluster '+str(k)+' are:')
print(df[df['KMeans_clusters']==k]['GICS Sector'].value_counts())
print(" ")
The number of stocks within each GICS Sector for Cluster 0 are: Financials 3 Consumer Discretionary 2 Health Care 2 Telecommunications Services 2 Consumer Staples 1 Energy 1 Industrials 0 Information Technology 0 Materials 0 Real Estate 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 1 are: Health Care 4 Information Technology 4 Consumer Discretionary 1 Consumer Staples 1 Energy 1 Financials 1 Telecommunications Services 1 Industrials 0 Materials 0 Real Estate 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 2 are: Industrials 52 Financials 45 Consumer Discretionary 33 Health Care 32 Real Estate 26 Information Technology 25 Utilities 24 Materials 18 Consumer Staples 17 Energy 6 Telecommunications Services 2 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 3 are: Energy 22 Information Technology 3 Materials 2 Consumer Discretionary 1 Industrials 1 Consumer Staples 0 Financials 0 Health Care 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 4 are: Consumer Discretionary 3 Health Care 2 Information Technology 1 Real Estate 1 Consumer Staples 0 Energy 0 Financials 0 Industrials 0 Materials 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64
# show boxplots of numerical variables for each K-Means cluster
fig, axes = plt.subplots(3, 4, figsize=(20, 20))
counter = 0
for ii in range(3):
for jj in range(4):
if counter < 11:
sns.boxplot(
ax=axes[ii][jj],
data=df,
y=df.columns[3+counter],
x="KMeans_clusters",
palette="viridis"
)
counter = counter + 1
fig.tight_layout(pad=3.0)
Cluster 0 - Large Market Capitalization / Dow Jones Industrial Average
Cluster 1 - "Cash is King"
Cluster 2 - S&P 500 / Diversification
Cluster 3 - "Ride the Energy Rollercoaster" portfolio / Growth mindset
Cluster 4 - High Earnings for a High Price
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, round(c,4)
)
)
print(" ")
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873. Cophenetic correlation for Euclidean distance and average linkage is 0.9423. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8694. Cophenetic correlation for Chebyshev distance and single linkage is 0.9063. Cophenetic correlation for Chebyshev distance and complete linkage is 0.5989. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708. Cophenetic correlation for Cityblock distance and single linkage is 0.9334. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375. Cophenetic correlation for Cityblock distance and average linkage is 0.9302. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
round(high_cophenet_corr,4), high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9423, which is obtained with Euclidean distance and average linkage.
# list of linkage methods for euclidean distance metric
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for Euclidean distance and {} linkage is {}.".format(
lm, round(c,4)
)
)
print(" ")
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873. Cophenetic correlation for Euclidean distance and average linkage is 0.9423. Cophenetic correlation for Euclidean distance and centroid linkage is 0.9314. Cophenetic correlation for Euclidean distance and ward linkage is 0.7101. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8694.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
round(high_cophenet_corr,4), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9423, which is obtained with average linkage.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Z = linkage(subset_scaled_df, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(subset_scaled_df))
hierarchy = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
hierarchy.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=5)
df_hierarchy = df.copy()
df_hierarchy.drop("KMeans_clusters", axis=1, inplace=True)
df_hierarchy['HC_clusters'] = hierarchy.labels_
#group dataset by Hierarchical clusters
cluster_profile_h = df_hierarchy.groupby("HC_clusters").mean()
#add counts for number of stocks in each cluster
cluster_profile_h["Count"] = (
df_hierarchy.groupby("HC_clusters")["Current Price"].count().values
)
#show dataframe with maximum values for each metric highlighted
cluster_profile_h.style.highlight_max(color="lightblue", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_clusters | ||||||||||||
| 0 | 77.884243 | 4.105986 | 1.516865 | 35.320359 | 66.775449 | -32825817.365269 | 1535255703.592814 | 2.903308 | 559027333.145509 | 32.437511 | -1.781988 | 334 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
In contrasts, the dendrogram for Ward linkage appears to provide better clustering, with 5 appearing to be the appropriate number of clusters
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=5)
subset_scaled_df["HC_clusters"] = HCmodel.labels_
df_hierarchy["HC_clusters"] = HCmodel.labels_
#group dataset by Hierarchical clusters
cluster_profile_h = df_hierarchy.groupby("HC_clusters").mean()
#add counts for number of stocks in each cluster
cluster_profile_h["Count"] = (
df_hierarchy.groupby("HC_clusters")["Current Price"].count().values
)
#show dataframe with maximum values for each metric highlighted
cluster_profile_h.style.highlight_max(color="lightblue", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_clusters | ||||||||||||
| 0 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 15 |
| 1 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
# print the names of the companies in each cluster
for cl in df_hierarchy["HC_clusters"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df_hierarchy[df_hierarchy["HC_clusters"] == cl]["Security"].unique().to_list())
print()
In cluster 3, the following companies are present: ['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Analog Devices, Inc.', 'Archer-Daniels-Midland Co', 'Ameren Corp', 'American Electric Power', 'AFLAC Inc', 'American International Group, Inc.', 'Apartment Investment & Mgmt', 'Assurant Inc', 'Arthur J. Gallagher & Co.', 'Akamai Technologies Inc', 'Albemarle Corp', 'Alaska Air Group Inc', 'Allstate Corp', 'Applied Materials Inc', 'AMETEK Inc', 'Affiliated Managers Group Inc', 'Ameriprise Financial', 'American Tower Corp A', 'AutoNation Inc', 'Anthem Inc.', 'Aon plc', 'Amphenol Corp', 'Arconic Inc', 'Activision Blizzard', 'AvalonBay Communities, Inc.', 'Broadcom', 'American Water Works Company Inc', 'American Express Co', 'Boeing Company', 'Baxter International Inc.', 'BB&T Corporation', 'Bard (C.R.) Inc.', 'BIOGEN IDEC Inc.', 'The Bank of New York Mellon Corp.', 'Ball Corp', 'Bristol-Myers Squibb', 'Boston Scientific', 'BorgWarner', 'Boston Properties', 'Caterpillar Inc.', 'Chubb Limited', 'CBRE Group', 'Crown Castle International Corp.', 'Carnival Corp.', 'Celgene Corp.', 'CF Industries Holdings Inc', 'Citizens Financial Group', 'Church & Dwight', 'C. H. Robinson Worldwide', 'CIGNA Corp.', 'Cincinnati Financial', 'Comerica Inc.', 'CME Group Inc.', 'Cummins Inc.', 'CMS Energy', 'Centene Corporation', 'CenterPoint Energy', 'Capital One Financial', 'The Cooper Companies', 'CSX Corp.', 'CenturyLink Inc', 'Cognizant Technology Solutions', 'Citrix Systems', 'CVS Health', 'Chevron Corp.', 'Dominion Resources', 'Delta Air Lines', 'Du Pont (E.I.)', 'Deere & Co.', 'Discover Financial Services', 'Quest Diagnostics', 'Danaher Corp.', 'The Walt Disney Company', 'Discovery Communications-A', 'Discovery Communications-C', 'Delphi Automotive', 'Digital Realty Trust', 'Dun & Bradstreet', 'Dover Corp.', 'Dr Pepper Snapple Group', 'Duke Energy', 'DaVita Inc.', 'eBay Inc.', 'Ecolab Inc.', 'Consolidated Edison', 'Equifax Inc.', "Edison Int'l", 'Eastman Chemical', 'Equity Residential', 'EQT Corporation', 'Eversource Energy', 'Essex Property Trust, Inc.', 'E*Trade', 'Eaton Corporation', 'Entergy Corp.', 'Edwards Lifesciences', 'Exelon Corp.', "Expeditors Int'l", 'Expedia Inc.', 'Extra Space Storage', 'Fastenal Co', 'Fortune Brands Home & Security', 'FirstEnergy Corp', 'Fidelity National Information Services', 'Fiserv Inc', 'FLIR Systems', 'Fluor Corp.', 'Flowserve Corporation', 'FMC Corporation', 'Federal Realty Investment Trust', 'First Solar Inc', 'General Dynamics', 'General Growth Properties Inc.', 'Gilead Sciences', 'Corning Inc.', 'General Motors', 'Genuine Parts', 'Garmin Ltd.', 'Goodyear Tire & Rubber', 'Grainger (W.W.) Inc.', 'Halliburton Co.', 'Hasbro Inc.', 'Huntington Bancshares', 'HCA Holdings', 'Welltower Inc.', 'HCP Inc.', 'Hartford Financial Svc.Gp.', 'Harley-Davidson', "Honeywell Int'l Inc.", 'Hewlett Packard Enterprise', 'HP Inc.', 'Hormel Foods Corp.', 'Henry Schein', 'Host Hotels & Resorts', 'The Hershey Company', 'Humana Inc.', 'International Business Machines', 'IDEXX Laboratories', 'Intl Flavors & Fragrances', 'International Paper', 'Interpublic Group', 'Iron Mountain Incorporated', 'Illinois Tool Works', 'Invesco Ltd.', 'J. B. Hunt Transport Services', 'Jacobs Engineering Group', 'Juniper Networks', 'Kimco Realty', 'Kansas City Southern', 'Leggett & Platt', 'Lennar Corp.', 'Laboratory Corp. of America Holding', 'LKQ Corporation', 'L-3 Communications Holdings', 'Lilly (Eli) & Co.', 'Lockheed Martin Corp.', 'Alliant Energy Corp', 'Leucadia National Corp.', 'Southwest Airlines', 'Level 3 Communications', 'LyondellBasell', 'Mastercard Inc.', 'Mid-America Apartments', 'Macerich', "Marriott Int'l.", 'Masco Corp.', 'Mattel Inc.', "McDonald's Corp.", "Moody's Corp", 'Mondelez International', 'MetLife Inc.', 'Mohawk Industries', 'Mead Johnson', 'McCormick & Co.', 'Martin Marietta Materials', 'Marsh & McLennan', '3M Company', 'Altria Group Inc', 'The Mosaic Company', 'Marathon Petroleum', 'Merck & Co.', 'M&T Bank Corp.', 'Mettler Toledo', 'Mylan N.V.', 'Navient', 'NASDAQ OMX Group', 'NextEra Energy', 'Newmont Mining Corp. (Hldg. Co.)', 'Nielsen Holdings', 'Norfolk Southern Corp.', 'Northern Trust Corp.', 'Nucor Corp.', 'Newell Brands', 'Realty Income Corporation', 'Omnicom Group', "O'Reilly Automotive", "People's United Financial", 'Pitney-Bowes', 'PACCAR Inc.', 'PG&E Corp.', 'Public Serv. Enterprise Inc.', 'PepsiCo Inc.', 'Principal Financial Group', 'Procter & Gamble', 'Progressive Corp.', 'Pulte Homes Inc.', 'Philip Morris International', 'PNC Financial Services', 'Pentair Ltd.', 'Pinnacle West Capital', 'PPG Industries', 'PPL Corp.', 'Prudential Financial', 'Phillips 66', 'Quanta Services Inc.', 'Praxair Inc.', 'PayPal', 'Ryder System', 'Royal Caribbean Cruises Ltd', 'Robert Half International', 'Roper Industries', 'Republic Services Inc', 'SCANA Corp', 'Charles Schwab Corporation', 'Sealed Air', 'Sherwin-Williams', 'SL Green Realty', 'Scripps Networks Interactive Inc.', 'Southern Co.', 'Simon Property Group Inc', 'Stericycle Inc', 'Sempra Energy', 'SunTrust Banks', 'State Street Corp.', 'Skyworks Solutions', 'Synchrony Financial', 'Stryker Corp.', 'Molson Coors Brewing Company', 'Tegna, Inc.', 'Torchmark Corp.', 'Thermo Fisher Scientific', 'TripAdvisor', 'The Travelers Companies Inc.', 'Tractor Supply Company', 'Tyson Foods', 'Tesoro Petroleum Co.', 'Total System Services', 'Texas Instruments', 'Under Armour', 'United Continental Holdings', 'UDR Inc', 'Universal Health Services, Inc.', 'United Health Group Inc.', 'Unum Group', 'Union Pacific', 'United Parcel Service', 'United Technologies', 'Varian Medical Systems', 'Valero Energy', 'Vulcan Materials', 'Vornado Realty Trust', 'Verisk Analytics', 'Verisign Inc.', 'Vertex Pharmaceuticals Inc', 'Ventas Inc', 'Wec Energy Group Inc', 'Whirlpool Corp.', 'Waste Management Inc.', 'Western Union Co', 'Weyerhaeuser Corp.', 'Wyndham Worldwide', 'Wynn Resorts Ltd', 'Xcel Energy Inc', 'XL Capital', 'Dentsply Sirona', 'Xerox Corp.', 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis'] In cluster 0, the following companies are present: ['Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', 'Chipotle Mexican Grill', 'Equinix', 'Facebook', 'Frontier Communications', 'Intuitive Surgical Inc.', 'Monster Beverage', 'Netflix Inc.', 'Priceline.com Inc', 'Regeneron', 'Waters Corporation', 'Yahoo Inc.'] In cluster 1, the following companies are present: ['Allegion', 'Apache Corporation', 'Chesapeake Energy', 'Charter Communications', 'Colgate-Palmolive', 'Kimberly-Clark', 'S&P Global, Inc.'] In cluster 4, the following companies are present: ['Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Cabot Oil & Gas', 'Concho Resources', 'Devon Energy Corp.', 'EOG Resources', 'Freeport-McMoran Cp & Gld', 'Hess Corporation', 'Kinder Morgan', 'Marathon Oil Corp.', 'Murphy Oil', 'Noble Energy Inc', 'Newfield Exploration Co', 'National Oilwell Varco Inc.', 'ONEOK', 'Occidental Petroleum', 'Range Resources Corp.', 'Spectra Energy Corp.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy'] In cluster 2, the following companies are present: ['Bank of America Corp', 'Citigroup Inc.', 'Ford Motor', 'Intel Corp.', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
# print the number of stocks in each GICS sector for each cluster
for k in range(0,df_hierarchy['HC_clusters'].nunique()):
print('The number of stocks within each GICS Sector for Cluster '+str(k)+' are:')
print(df_hierarchy[df_hierarchy['HC_clusters']==k]['GICS Sector'].value_counts())
print(" ")
The number of stocks within each GICS Sector for Cluster 0 are: Health Care 5 Information Technology 4 Consumer Discretionary 3 Consumer Staples 1 Real Estate 1 Telecommunications Services 1 Energy 0 Financials 0 Industrials 0 Materials 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 1 are: Consumer Staples 2 Energy 2 Consumer Discretionary 1 Financials 1 Industrials 1 Health Care 0 Information Technology 0 Materials 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 2 are: Financials 4 Telecommunications Services 2 Consumer Discretionary 1 Consumer Staples 1 Energy 1 Health Care 1 Information Technology 1 Industrials 0 Materials 0 Real Estate 0 Utilities 0 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 3 are: Industrials 52 Financials 44 Consumer Discretionary 35 Health Care 34 Information Technology 27 Real Estate 26 Utilities 24 Materials 19 Consumer Staples 15 Energy 7 Telecommunications Services 2 Name: GICS Sector, dtype: int64 The number of stocks within each GICS Sector for Cluster 4 are: Energy 20 Information Technology 1 Materials 1 Consumer Discretionary 0 Consumer Staples 0 Financials 0 Health Care 0 Industrials 0 Real Estate 0 Telecommunications Services 0 Utilities 0 Name: GICS Sector, dtype: int64
# show boxplots of numerical variables for each Hierarchical cluster
fig, axes = plt.subplots(3, 4, figsize=(20, 20))
counter = 0
for ii in range(3):
for jj in range(4):
if counter < 11:
sns.boxplot(
ax=axes[ii][jj],
data=df_hierarchy,
y=df_hierarchy.columns[3+counter],
x="HC_clusters",
palette="viridis"
)
counter = counter + 1
fig.tight_layout(pad=3.0)
Cluster 0 - Growth for a Price
Cluster 1 - Short-term Poor, Long-term Rich
Cluster 2- DJIA
Cluster 3 - Diversification
Cluster 4 - Energy-specific portfolio
You compare several things, like: Which clustering technique took less time for execution?
Which clustering technique gave you more distinct clusters, or are they the same? How many observations are there in the similar clusters of both algorithms?
How many clusters are obtained as the appropriate number of clusters from both algorithms?
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Differences or similarities in the cluster profiles from both the clustering techniques
Based on the analysis conducted by Trade&Ahead, the following actionable insights and recommendations can be made:
Identify Client's Financial Goals and Risk Tolerance: Before recommending any specific portfolio, Trade&Ahead should thoroughly understand the financial goals and risk tolerance of their clients. This will enable them to tailor their recommendations according to individual client needs.
Recommend Clusters as Potential Portfolios: Once the client's financial goals and risk tolerance have been identified, Trade&Ahead can recommend clusters of stocks that align with these requirements. These clusters should consist of stocks that have similar characteristics and exhibit patterns that match the client's investment behaviors.
Consider Standard Indexes as Alternatives: While recommending clusters is one approach, Trade&Ahead should also consider the possibility that many of these clusters essentially act as substitutes for standard indexes like the Dow Jones Industrial Average and the S&P 500. In cases where these standard indexes can more easily achieve the client's goals, Trade&Ahead may advise investing in index funds or ETFs rather than creating a custom cluster.
Perform Financial Statement Analysis: Instead of relying solely on the recommended clusters, Trade&Ahead can utilize them as a starting point for further analysis. Specifically, they can conduct a thorough financial statement analysis of the individual stocks within each cluster.
Identify Outliers and Deviations: During the financial statement analysis, Trade&Ahead should pay close attention to stocks that do not fit the profile of the cluster or exhibit significant deviations from the expected patterns. These stocks may present opportunities or risks that can impact the overall performance of the portfolio.
Recommend Individual Stocks: If selecting individual stocks aligns with the client's investment strategy, Trade&Ahead can leverage the insights gained from financial statement analysis to identify specific stocks that are likely to outperform their peers or underperform them.
Buy or Sell Recommendations: Based on the identified stocks' characteristics and expected performance, Trade&Ahead can provide buy or sell recommendations to their clients. These recommendations should be aligned with the client's financial goals, risk tolerance, and the analysis conducted on individual stocks.
By following these recommendations, Trade&Ahead can provide personalized investment advice to their clients while considering both the cluster-based approach and the unique characteristics of individual stocks. This approach will help clients achieve their financial goals while managing risk effectively.